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Abstract 

One approach to estimating a species tree from a coUection of gene trees is to first 
estimate probabilities of clades from the gene trees, and then to construct the 
species tree from the estimated clade probabilities. While a greedy consensus 
algorithm, which consecutively accepts the most probable clades compatible 
with previously accepted clades, can be used for this second stage, this method 
is known to be statistically inconsistent under the multispecies coalescent model. 
This raises the question of whether it is theoretically possible to reconstruct the 
species tree from known probabilities of clades on gene trees. 

We investigate clade probabilities arising from the multispecies coalescent 
model, with an eye toward identifying features of the species tree. Clades on 
gene trees with probability greater than 1/3 are shown to reflect clades on the 
species tree, while those with smaller probabilities may not. Linear invariants 
of clade probabilities are studied both computationally and theoretically, with 
certain linear invariants giving insight into the clade structure of the species 
tree. For species trees with generic edge lengths, these invariants can be used 
to identify the species tree topology. These theoretical results both confirm 
that clade probabilities contain full information on the species tree topology 
and suggest future directions of study for developing statistically consistent 
inference methods from clade frequencies on gene trees. 
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1. Introduction 

A fundamental problem in evolutionary biology is to determine relative relat- 
edness of species, usually by seeking a rooted tree that diagrammatically depicts 
these relationships. Although phylogenetic methods of inferring relationships 
between genes sampled from individuals in the different species are now highly 
developed, such gene trees are not species trees. Even in the absence of errors 
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Figure 1: Gene trees within a species tree. In the multispecies coalescent, gene hneages 
sampled from species are assumed to coalesce (form nodes in the gene tree) no more recently 
than their most recent common ancestor (MRCA) in the species tree. Coalescence of lineages 
in populations more ancient than their MRCA can lead to gene tree topologies that are 
discordant with the species tree topology. Using upper case letters for gene lineages sampled 
from their corresponding species, failure of the A and B lineages to coalesce in their MRCA 
population makes any of the (j) coalescences between A, B, and C equally likely under the 
model in the MRCA population of a, b, and c. (a) The gene tree is {{{{B,C), A), D, E). (b) 
The gene tree is {{{B, C), A), (D, E)). 



due to estimating gene trees from DNA sequences, gene tree topologies need not 
match the underlying species tree. In recent years, various methods have been 
proposed for inferring species trees from genetic data [6, 9, 12]. Many of these 
methods first estimate gene trees, and then resolve the possible conflicts among 
them to obtain an overall estimate of the species tree. 

An important cause of gene tree conflict is the population effect of incomplete 
lineage sorting^ in which gene lineages coalesce in ancestral populations earlier 
than the time these lineages first enter a common ancestral population. The 
multispecies coalescent model [16, 19, 17, 7, 6] is commonly used to model this 
process, producing a distribution of rooted gene trees given a rooted species 
tree topology and branch lengths (a measure of time and population size on 
each edge of the species tree). The multispecies coalescent provides a natural 
framework for incorporating population effects, allowing gene trees to possibly 
be discordant with the species tree (see Fig. 1), a phenomenon that is very 
common in multilocus studies [18, 8, 4]. 

Although the distribution of gene tree topologies from the multispecies co- 
alescent determines the species tree [1], estimating this distribution is difficult 
because there are so many possible topologies: (2n — 3)!! when n species are un- 
der study. Thus most topologies are unlikely to be observed among a moderate 
number of gene trees. An alternative is to estimate a smaller set of probabilities 
which is a function of gene tree probabilities but that still retains enough infor- 
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mation to identify the species tree. Other works have considered rooted triples 
[5, 10, 14] and unrooted gene tree topologies [1, 13]. Another possibility, which 
is our focus here, is to use probabilities that a gene tree has a given dade, a set 
of leaves descended from a node of the gene tree that is not ancestral to any 
other leaves in the gene tree. The probability of a clade under the multispecies 
coalescent (or any model of gene tree generation) is obtained by simply adding 
the probabilities of all gene trees that display the given clade [5] . 

The probability of a clade can be estimated from a collection of gene trees 
by considering the proportion of gene trees displaying the clade. Since this 
procedure does not take into account uncertainty in the gene trees, which are 
themselves estimates from genetic data, a more sophisticated method would 
quantify the uncertainty in the clades by using posterior probabilities or boot- 
strap support values for clades obtained from Bayesian or maximum likelihood 
analyses of the gene trees. The software BUCKy [2], for example, takes this ap- 
proach, using posterior probabilities for clades and additionally incorporating 
a prior distribution for the amount of gene tree conflict to yield a concordance 
factor for each clade. 

One of the most straightforward methods for constructing a species tree 
from clade probabilities is to use greedy consensus, in which the clade with the 
highest probability (or concordance factor) is accepted, provided it is compatible 
with previously accepted clades. This process is repeated until a fully resolved 
tree is formed [3]. This procedure is implemented in BUCKy to construct a 
concordance tree, which is sometimes interpreted as an estimated species tree 
[4]. 

To justify a greedy approach, one needs to investigate whether the most 
probable clades tend also to be clades on the species tree. Indeed, we show 
in Section 4 that under the multispecies coalescent, any clade with probabil- 
ity greater than 1/3 must be on the species tree, suggesting that the standard 
majority-rule consensus (which only accepts clades occurring more than 50% of 
the time) is very conservative in this setting. If the greedy consensus approach 
is used for clades with probability greater than 1 /3 (leaving the tree unresolved 
with respect to clades with lower probability), then this "not-too-grcedy" con- 
sensus approach is not misleading, in the sense that it asymptotically cannot 
return a false species tree clade as the number of loci approaches infinity. 

In contrast, previous results have shown that when greedy consensus is ap- 
plied without restrictions on clade probabilities, the returned tree can be mis- 
leading (i.e., for some species trees, as the number of loci increases, the greedy 
consensus method is increasingly likely to produce a tree that disagrees with the 
true species tree) for some sets of branch lengths [5] . These "too-greedy zones" 
of edge lengths occur on 4-taxon asymmetric species trees and on any species 
tree topology with five or more leaves. Thus, caution must be used when prob- 
abilities of clades are less than 1/3; it is not obvious how to determine which 
low-probability clades are on the species tree, even if clade probabilities are 
known exactly. Other examples show that the most probable fc-clade (a clade 
of fc > 2 elements), is not necessarily a clade on the species tree, even if the 
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species tree is known to have a fc-clade. 

Undeterred by these negative results, we show in Sections 5 and 6 that under 
the multispecies coalescent with one hneage sampled per species, the set of clade 
probabilities does identify the species tree topology for generic branch lengths for 
any number of species. The proof is based on discovering a linear combination of 
clade probabilities (a linear invariant) that is equal to zero for any branch lengths 
on any species tree with a given clade. In theory, if clade probabilities are known, 
it is therefore possible to identify the species tree by determining all of its clades. 
While this suggests a statistically consistent method of inferring a species tree 
from estimated clade probabilities, it remains a challenge to incorporate this 
insight into a practical method that outperforms greedy consensus on most 
finite data sets. 

Finally, in Section 6 we extend our results, in part, to cases where the species 
tree is non-binary and where an arbitrary number of lineages is sampled per 
species. 

Although we frame our questions within the framework of the multispecies 
coalescent, a careful reading of our arguments reveals that the essential feature 
of the model that we use is that lineages are exchangeable. If two gene lineages 
are present in the same population at a particular point in time on the species 
tree, then above that point, the model assumes that both lineages behave the 
same way. Much of this work, then, should be robust to variations on the 
coalescent model that preserve exchangeability. 

On a more technical note, there is a key difference in understanding clade 
probabilities versus many other sets of probabilities related to gene trees or 
species trees: the failure of marginalization arguments. As this difference plays 
an important, but unspoken, part throughout this work, we highlight it here. 

The problem of establishing idcntifiability of a species tree from unrooted 
gene tree probabilities that was taken up in [1] is superficially similar to the 
clade problem of this paper. Both unrooted gene tree probabilities and clade 
probabilities can be obtained by summing probabilities of appropriate rooted 
gene trees. The sum is either over all rooted gene trees with the same unrooted 
topology, or over all rooted gene trees that have the clade in question. 

Note also that the probability of a gene tree on a subset of the taxa can 
be obtained by summing probabilities of gene trees on the full set that display 
the given gene tree when restricted to the subset. Such marginalization of a 
gene tree distribution to fewer taxa is possible for either rooted or unrooted 
gene trees. Consequently, for most arguments in [1] it was sufficient to focus on 
small trees, with at most five taxa. Indeed, similar marginalization arguments 
are standard throughout phylogenetic theory. 

Unfortunately, a marginalization approach fails for studying clades when 
the species tree is unknown. Given clade probabilities arising from an 7i-taxon 
species tree a under the multispecies coalescent, one would like to be able to 
determine clade probabilities arising from an induced fc-taxon tree displayed 
on a. However, probabilities of clades on the fc-taxon induced tree cannot be 
obtained from a linear combination of the clade probabilities associated with the 
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n-taxon species tree in a way that is independent of the species tree topology. 
We demonstrate this formally is the case where A: = 3 and n = 4 in Appendix 
Appendix A. 

This inability to marginalize clade probabilities without knowing the species 
tree topology motivated looking for an invariant that would hold for clades on 
trees of any size. Although only linear invariants are needed in the proof of iden- 
tifiability, the invariants constructed for A:-clades involve a linear combination 
of 2'^~^ clade probabilities. These rather elaborate invariants and the inability 
to marginalize clade probabilities to smaller trees lead to a different flavor for 
the proof of species tree identifiability from clade probabilities. 

2. Definitions 

Let <Y be a finite set, whose elements we refer to as taxa. A species tree on 
X means a pair a — {ip,X)^ where "0 is a rooted, topological tree whose leaves 
arc bijcctively labelled by elements of X, and A = (Ai, . . . , A^) is a collection of 
lengths for the internal branches of "0. We refer to as a species tree topology^ 
and always assume all internal nodes of ip except the root have degree at least 

3. If all internal nodes except the root have degree 3 and the root has degree 2, 
we say that ip and a are binary. 

We use a modified Newick notation for species trees, as in [1], in which we do 
not specify the lengths of pendant edges, since only the lengths of internal edges 
affect probabilities of gene tree topologies under the multispecics coalescent. For 
example, we write ((a, 6):t,c) for a 3-taxon species tree with one internal edge 
with length t, measured in coalescent units. If there is a constant effective 
population size, N, over an edge of the species tree, then a length of t indicates 
that the edge represents Nt generations [6]. For varying effective population 
size, a non-linear scaling is needed to relate coalescent units to generations. 
Species trees are thus not assumed to be ultrametric in coalescent units. 

In discussing trees, we find it convenient in various settings to use either 
spatial or temporal terminology. For instance, if {v,w) is a directed edge in ^ 
pointing away from the root, then we may say that v is above, or an ancestor, of 
w and that w is below, or a descendant, of v. Natural extensions of these terms 
should be clear from context. 

We denote taxa in X by lower case letters such as a, 6, c, ... . To distinguish 
between taxa and sampled genes from those taxa, we use the corresponding 
upper case letters A, B,C, . . . to denote the genes, with the set Xg denoting 
the full set of genes, one for each taxon. Similarly, a subset of taxa C C X 
has a corresponding subset of genes Cg C Xg. A sampled gene tree from the 
multispecics coalescent model on a will thus have leaves labelled by Xg, and in 
general may have any topology, regardless of the species tree topology ip. More 
specifically, by a gene tree T we mean a binary, rooted topological tree with 
leaves bijectively labelled by Xg. We emphasize that for this article gene trees 
are topological only, with no edge lengths specified. We require that gene trees 
be binary, since under the multispecics coalescent only binary gene trees have 
positive probability. 
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Definition. If ^ is a species tree topology on X, and A C X, then the most 
recent common ancestor of A, MRCA(^), is the node of ■0 that is ancestral to 
all elements of A and which is a descendant of any other node ancestral to all 
elements of A. 

Definition. Let desc;f(v) C X denote the elements of X descended from a 
node u of a species tree topology tp on X, so that if ^ C X, then A C 
descAf(MRCA(.4)) C A". A clade C on i/j is a subset of X such that C = 
descA'(MRCA(C)). 

The notions of MRCA and clade extend to gene trees in an obvious way, 
replacing X, ijj, C, with Xg, T, Cg in the definitions. 

Definition. For a gene tree T, the set of all clades on T is denoted 7i{T). 
Similarly, for a species tree a = {ip, A) the set of clades on tp is denoted 'H(o-) = 

In discussing the relationships between a subset y of the taxa X on a tree 
tp, we use the terminology of a displayed tree: a tree obtained from the full 
tree by first passing to the rooted subtree spanned by y, and then suppressing 
any non-root nodes of degree 2 [20]. As an example, the species tree in Fig. 1 
displays {(b,d),e). The notion of displayed trees can be applied in the context 
of either species trees (with or without branch lengths) or gene trees. 

A detailed presentation of the multispecies coalescent model is given in [1], so 
we omit repeating that here. Because we focus in this paper on the probabilities 
of observing gene trees or clades on gene trees under that model, we fix the 
following notation. 

Definition. Under the multispecies coalescent model on a fixed species tree a 
on taxa X, the probabilities of a gene tree T, and a clade Cg on gene trees are 
denoted Fa{T) and P^iCg), respectively. 

If more than one lineage is sampled per species, a generalization of our 
results on species tree identifiability still holds. For this extension, we require 
the following definitions. (See Fig. 2 for an example.) 

Definition. Let X = {xi, . . . ,Xn} be a taxon set, with \X\ = n. Let S = 
((5i, . . . , (5„) be the number of individuals sampled from species Xi, i = 1, . . . ,n. 
With Xij, 1 < j < <5i, denoting the individuals in taxon Xi, X* = {xij} is the 
set of all sampled individuals, so \X*\ = J2^=i ^i- 

An extended species tree a* = {tp* , A, 6) on A" is a species tree (■(/;*, A) on X* 
such that for each 1 < i < n all the leaves Xij , 1 < j < (5i have a common parent 
in Tp* . 

The pruned species tree topology t/i on A" is obtained from ip* by labeling the 
parent of the Xij by Xi for each i with 6i > 1, and then excising the leaves Xij 
and the pendant edges on which they lie. 

Note that while an extended species tree gives rise to a species tree by the 
pruning process, in an extended species tree a branch length is assigned to those 
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Figure 2: (a) A gene tree with multiple lineages sampled from several species within a species 
tree. The taxa are X = {xi, X2, 2:3, X4, 0:5} with 5 = (3, 1, 1, 2, 1) lineages sampled from them, 
(b) The extended version of the species tree with taxa X* = {xn , a;i2, . . . , 2:51} and one 
lineage sampled per taxon. Under the multispecies coalesccnt, the probability of any clade 
ylg C Xg is the same for both the species trees in (a) and in (b). 



edges which become pendant in the species tree whenever there are two or more 
sampled individuals in the taxon. Since our notion of a species tree in this paper 
does not have pendant edge lengths, an extended species tree thus carries more 
edge length information than the associated species tree. 

Gene trees arising from the coalesccnt model on an extended species tree have 
leaves labelled Xg — {Xn, . . . , Xis-^ , • • ■ , X„i, . . . , X„s„} and arc, with probabil- 
ity 1, binary. One readily checks that the probability of such a gene tree under 
the multispecies coalesccnt on the extended species tree is exactly the same as 
the probability of the gene tree under a multiple individual sampling scheme on 
the species tree (with some pendant edge lengths) obtained by pruning. Indeed, 
this is why we have introduced such trees. We will use them to easily extend 
results where one individual is sampled per species to the multiple sampling 
situation, in Proposition 13 and Corollary 14. 

Finally, note that by construction, for each i ~ l,...,n, the set Ai = 
} is a clade on the extended species tree. But of course a set 
{Ai)g = {Xii, . . . , Xis-} need not be a clade on any given gene tree. 

3. Arbitrary gene tree distributions 

Though the remainder of this paper is concerned only with the gene tree 
distribution arising from the multispecies coalesccnt model, in this section we 
investigate clade probabilities for arbitrary binary gene tree distributions. The 
main observation is that without special assumptions on the gene tree distribu- 
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tion, the clade probabilities do not contain enough information to recover the 
gene tree distribution. 

Note that every gene tree must have as clades all singleton sets of gene labels, 
as well as the full set Xg. We refer to these as trivial clades. Any other subset 
Cg C Xg is SL cladc on some gene trees, but not others. 

For an arbitrary distribution of gene trees on a taxon set X, let P(r) denote 
the probability of gene tree T. Then for each subset Cg Xg, the probability 
that Cg is a clade on a gene tree is 

¥{Cg) = 5]p(c,|r)P(r) = J2i{Cg e h(t))p(t), 

T T 

where / is the indicator function with values of 1 or 0. Note that the probability 
of any trivial clade is therefore 1 . 

We emphasize that the clade probabilities for an ri-taxon species tree a do 
not form a probability distribution. The presence of different clades may not 
be mutually exclusive events (for instance, if Cg C Cg), and their probabilities 
do not sum to 1. 

Proposition 1. If \X\ ^ n, then for any distribution of binary gene trees the 
sum of the probabilities of all non-trivial clades is n ^ 2. 



Proof. Denoting n-taxon gene trees by T, 



2)P(r)=n-2. (1) 

□ 

of binary gene trees on a taxon set 
P(T) cannot be identified from the 



C g C iVg ^ g ^ '^g T T ^ g ^ 

non-trivial non-trivial non-trivial 

But since each binary gene tree has n — 2 non-trivial clades, this shows that 
^ F{Cg) = J2{n- 

CgCXg T 



Theorem 2. For an arbitrary distribution 
X with \X\ > A, the gene tree probabilities 
clade probabilities P(Cg). 



Proof. The set Xg has 2" - n - 2 subsets Cg with 2 < |Cg| < n - 1. Using 
Proposition 1, the clade probabilities can thus be specified by point in a (2" — 
n — 3)-dimensional vector space. However, there are {2n — 3)!! = 1 • 3 • • • (2n — 3) 
binary gene trees on Xg, so a, gene tree distribution is specified by a point in a 
((2n — 3)!! — l)-dimensional vector space. But since 

(2n - 3)!! - 1 > 2" - n - 3 

when 71 > 4, and the map from gene tree probabilities to clade probabilities is 
linear, the map is not invertible at any point. □ 
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We note that for an arbitrary distribution on multifurcating gene tree topolo- 
gies, the trivial invariant in Eq. (1) need not hold. However, the argument es- 
tablishing Theorem 2 can be modified to apply to such distributions, since the 
number of multifurcating trees is greater than the number of binary ones. 

4. Highly probable gene tree clades are species tree clades 

For the remainder of the paper, we assume that both gene tree probabilities 
Fcr{T) and clade probabilities F„{Cg) arise from the multispecies coalescent on 
a species tree a = {ip, A). 

Theorem 3. Let a = (tA, A) be a binary species tree on X , with edge lengths 
Ai > e > 0. Under the multispecies coalescent model, suppose Cg C Xg has clade 
probability f'c{Cg) > (1/3) cxp(— e). Then C is a clade on a; that is, C G H{a). 

Furthermore, if (l/3)cxp(— e) is replaced with any smaller number, this 
statement is no longer true for all such choices of species trees and non-trivial 
clades: For any k < (1/3) exp(— e), there exists a species tree a on X and a taxon 
set C C X with 1 < |C| < \X\ such that C is not a clade on a, yet ¥^{Cg) > k. 

Proof. If C is a trivial clade, there is nothing to show, so we may assume 1 < 
\Cg\ < \X\. We prove the contrapositive: if C is not a clade on ip, then Po-(Cg) < 
(l/3)exp(-e). 

Suppose C is not a clade on the species tree, so there exist a,b G C and 
c € X \ C such that does not display the rooted triple ((a, &),c). Thus, 
the rooted triple probability satisfies P^(((A, B), C)) < (l/3)exp(-e) [15]. But 
then 

(l/3)exp(-e) > VMAB),C)) > P,(C<,), 
since {{A, B), C) is displayed on every gene tree on which Cg is a clade. 

To establish the last claim of the theorem, we construct an example. For 
any set C with 1 < |C| < \X\, pick some a G C, and some c G A" \ C. Let 
C — C \ {a}. Consider a binary species tree a which has a subtree of the form 
((a, c):(5, rc':7), where Tc is any rooted tree on C . Note then that C is not a 
clade on a. 

By taking 7 to be large, the probability that the lineages from C coalesce 
below MRCA({a, c} U C) can be made as close to 1 as desired. Because the 
probability that lineages A and C fail to coalesce within time 5 is exp(— (S), by 
also choosing 5 k, e the probability that three lineages (one for A, one for C, 
and one for Cg) enter the ancestral population above this MRCA can be made 
as close to exp(— e) as we wish. Thus the probability that Cg will be a clade 
on a gene tree can be made as close to (1/3) exp(— e) as we wish by taking the 
branch above this subtree to be long. □ 

Setting e = yields Corollary 4. 

Corollary 4. Let a be a binary species tree on taxa X, with positive edge 
lengths. Under the multispecies coalescent model, suppose C d X is such that 
F<j{Cg) > 1/3. Then C is a clade on a. 
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Furthermore, this statement is no longer true for 1 < |C| < \X\ if 1/3 is 
replaced with any smaller number. 

If the species tree is not binary, a slightly weaker result holds. 

Theorem 5. Suppose the species tree a is not necessarily binary, and C <Z X 
is such that Pcr(Cg) > 1/3. Then C is a clade on a. 

Furthermore, this statement is no longer true for 1 < |C| < \X\ if 1/3 is 
replaced with any smaller number. 

Proof. To show C is a clade, we suppose c E X \ C and demonstrate that c ^ 
descAr(MRCA(C)). Choose a, 6 e C such that MRCA(C) = MRCA({a, b}). Note 
that P^(((A,B),C)) > FaiCg) > 1/3 implies that the rooted triple ((a,6),c) is 
displayed on a. Thus c ^ descA-(MRCA(C)). 

That 1/3 cannot be replaced with a smaller number is a consequence of 
Corollary 4. □ 

5. Clade invariants 

A clade invariant for a species tree topology is a polynomial in the proba- 
bilities of clades on gene trees that vanishes for all edge length assignments to 
the species tree. More completely, a clade invariant associated to an n-taxon 
species tree topology ip is a multivariate polynomial in 2" — n — 2 indeterminates 
(one for every non-trivial clade) which evaluates to zero at any vector of clade 
probabilities Pa-(Cg) arising from a = A), regardless of the values of A. 

Proposition 1 gives an example of a clade invariant for binary gene trees 
that, in addition, is independent of all features of "0 except the number of taxa: 



We call this the trivial invariant, and emphasize it is satisfied by clade proba- 
bilities from any species tree on X. 

Clade invariants can be computed for small trees using computational al- 
gebra software, such as Singular [11]. For each edge length A^, one sets A^ = 
exp(— Aj), and then expresses the clade probabilities as multivariate polynomi- 
als in the A^. Grobner basis methods for variable elimination then allow one to 
determine generators of the polynomial ideal of all clade invariants. Such com- 
putations were useful in formulating the general construction of certain linear 
invariants given below. The existence of these clade invariants forms the basis 
our proof of species tree topology identifiability in Section 6. 

Theorem 6. Let A X be a subset of taxa with at least two elements, and 
C ^ X \ A a non-empty set of taxa not in A. For distinct a,b € A, let 
= ^ \ {a, b}. Then if A is a clade on a. 



^ P,(Cg)-(n-2) = 0. 



Cg C 



m-trivial 
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We note that this theorem apphes to any species tree, including non-binary 
ones. Moreover, since a non-binary species tree a can be thought of as any of 
its binary resolutions with length assigned to any introduced edges, the clade 
probabilities arising from such a a will satisfy the polynomials of the theorem 
for every binary resolution. Thus in the statement of the theorem the phrase 'if 
.4 is a clade on ct' can be replaced with 'if ^ is a clade on a binary resolution 
of cr.' 

For the proof, it is useful to have the notion of compatible clades: 
Definition. Two clades. Ag and Bg arc compatible if AgHBg = 0, Ag C Bg, or 

BgCAg. 

If a clade Ag is on a gene tree T, then all other clades appearing on T must 
be compatible with Ag. 

The proof of Theorem 6 uses partitions of subsets of the taxon set X that 
occur as follows: Consider an internal node v of cr, and let A = desc;^!^^)- Then 
in a realization of the coalescent process on cr, some of the lineages of genes in 
Ag may coalesce below v, so that there are \A\ or fewer lineages at v. Each such 
lineage determines a subset of Ag, namely its descendants, and hence the set of 
lineages determines a partition of A. 

As an example, consider the species tree in Fig. 1. For the set A = {a, b, c}, 
the partition at MRCA(^) in both subfigures is {{a}, {6}, {c}}. Note that the 
partition of such a set A is not affected by any coalescent events occurring in 
the MRCA population, but only by those below. The only other partition of 
A possible for this species tree is {{a, 6},{c}}. For the set A = {a,b,c,d}, 
the partition at MRCA(^) in Fig. la is {{a}, {6, c}, {d}}, and in Fig. lb is 
{{a,&,c},{d}}. 

Proof of Theorem 6. Suppose ^ is a clade on -0, with v — MRCA(^). Letting 
7r(^) = Ai\ - ■ ■ \ Ak denote a partition of A, we also use n{A) to denote the event 
that the coalescent process on cr produces lineages at v defining this partition. 

We will condition on this event: Specifically, recalling the notion of a coa- 
lescent history from [7], 

P.(7r(^))=^ ^ P.(T,M- (3) 

T history , 
hrp coriBiBtciit 

with Tr{A) 

For B CI X the joint probability Pcr(;Bg, t^{A)) is computed similarly, by restrict- 
ing the outer sum on the right side of Eq. (3) to those gene trees that have clade 
Bg. Then 

and by the law of total probability, we have the clade probability 
T„{Bg) = ^ABg\^{A))^A<A)). 

7r{A) 
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Thus, to establish Eq. (2), it is enough to show that 

I ^ PASgU{A}UCg\n{A))\ - I J2 Va{SgU{B}UCg\n{A))] =0 

\SCA' J \SCA' J 

(4) 

holds for all choices of partition Tr{A). 

To establish Eq. (4), we show that non-zero terms cancel pairwise. How- 
ever, which terms cancel depends on the partition, so for the remainder of the 
argument we fix tt{A), and assume the partition sets are indexed so that a € Ai- 

Note first that if 6 e Ai as well, then we are conditioning on an event that 
requires that the A and B lineages have coalesced into one below v. Thus, any 
clade on a gene tree that includes A and Cg must include B, because we have 
assumed that C is non-empty. Similarly, any clade that includes B and Cg must 
include A. Therefore, all probabilities in Eq. (4) are zero, so the equation holds. 

Otherwise, assume b £ A2- We wish to give a bijective correspondence 
between non-zero clade probabilities in the first sum in Eq. (4) and equal clade 
probabilities in the second sum, with the correspondence dependent on the 
partition tt{A). That is, we wish to show that for each iSi C A', there is a 
corresponding ^2 C A' such that 

PA{Si)g U {A} U Cg I tt{A)] = VAiS2)g U {B} U Cg I tt{A)]. (5) 

Consider first the case when {Si)g U {A} U Cg is compatible with the clades 
{Ai)g, . . . , {Ak)g- Because C is non-empty, this occurs exactly when Si U {a} is 
the union of some of the Ai . Thus we have 

5iU{a} = ^iU|jA,, 

j 

for some ij, with all unions here disjoint. Moreover, since 6 ^ iSi U {a}, A2 
does not appear in this expression. We therefore define S2 by the expresssion 
of disjoint unions 

52U{6}=^2u|JA,. 

j 

Eq. (5) then holds, since for the coalescent process on a above v the lineages 
corresponding to Ai and A2 are exchangeable. This gives us a bijection between 
<Si C A' and iS2 C A' for which either (and hence both) of the probabilities in 
Eq. (5) are non-zero. 

For all other Si, S2, the sets Si U {A} U Cg and ^2 U {B} U Cg are not 
compatible with 7r{A), and hence these probabilities are zero. □ 

As a simple corollary, we immediately obtain what we call 'cherry-swapping' 
invariants, which express that the probability of any clade containing exactly 
one taxon of a 2-clade on the species tree is unchanged when that taxon is 
swapped out for the other taxon in the 2-clade. 
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Corollary 7. (Cherry-swapping invariants) Suppose {a, 6} is a 2-clade on a 
species tree with taxa X . Then for any C C A" \ {a, 6}, 

P,({A}UC<,)-P,({B}UCg)-0. 

To illustrate Theorem 6, we consider next all species tree topologies on 5 or 
fewer taxa, and discuss invariants produced by this construction. For notational 
ease, we denote gene tree clades by juxtaposition of labels, rather than by sets, 
so, for instance {A,B,D,E} will be denoted ABDE. Our focus is on those 
invariants associated to 3- and 4-clades, and we do not explicitly list cherry- 
swapping invariants except for the 3-taxon tree. 

Example. For the species tree topology t/j ~ {{a,b),c), the cherry-swapping 
invariant, 

is the only one produced by Theorem 6. 

Example. For the 4-taxon caterpillar tree topology ip = {{{a,b),c),d), in ad- 
dition to the three cherry-swapping invariants, we find for A ~ {a, b, c} the 
invariants 

{FaiAD) + ¥,{ABD)) - (F^iCD) + F,{BCD)) = 0, (for AJ = {b]) 
{PJBD)+P^{ABD)) - (P„{CD)+P„{ACD)) = 0, (for A' = {a}) 
{Pa{AD) + F,{ACD)) - (P^iBD) + P,{BCD)) = 0, (for ^' = {c}). 

We note that there are relations between these: the second invariant is obtained 
from the first by a cherry-swapping move, and the third is the sum of two cherry- 
swapping invariants. 

For the 4-taxon balanced tree topology i/i = (((a, 6), (c, d)), only the six 
cherry-swapping invariants are obtained. 

Example. If ?/' is either the 5-taxon caterpillar tree topology ((((a, 6), c), d), e), 
or the balanced tree topology (((a, 6), c), (d, e)), consider A = {a, 6, c}. 
Then for A' = {5}, we obtain for various choices of C. 

{PMD) + PMBD)) - {V„{CD) + P,{BCD)) = 0, (6) 

{PME) + Pa{ABE)) - {V„{CE) + P„{BCE)) = 0, (7) 
(PaiADE) + V„{ABDE)) - {V„{CDE) + ¥„{BCDE)) = 0. (8) 

Note that for the balanced species tree, Eq. (7) follows from Eq. (6) by cherry 
swapping D and E. However, for the caterpillar species tree, Eqs. (6) and (7) 
are not related by a cherry swap. 

For A' = {a} with singleton C we obtain Eqs. (6)-(8) again, up to cherry- 
swapping lineages A and B. 

For A' = {c} with singleton C, we obtain invariants such as 

(P^AD) + Pa{ACD)) - (PaiBD) + Pa{BCD)) = 0, (9) 
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but Eq. (9) is simply the sum of two cherry-swapping invariants for the cherry 
{a, b}, with C = {d} and {c, d}. In general, if the taxa in A\ A' span a smaller 
clade than the invariant produced will be a sum of invariants for the smaller 
cladc. Indeed, this phenomenon occurred above, for the 4-taxon caterpillar. 

Example. For the 5-taxon caterpillar topology tp = {{{{a,b),c),d),e), taking 
A = {a, b, c, d} and using A' = {b, c} and A' = {a, 6}, we obtain two invariants: 

{Pa{AE) + P^{ABE) + P^{ACE) + F„{ABCE)) 

- {Pa{DE) + PJBDE) + P^iCDE) + P„{BCDE)) = 0, 

and 

{Va{CE) + PMCE) + Pa{BCE) + ¥,{ABCE)) 

- (VaiDE) + F„{ADE) + V„{BDE) + f„{ABDE)) ^ 0. 

Other choices of A' give only invariants in the space spanned by those previously 
discussed. 

Example. For the 5-taxon pseudo-caterpillar tree topology ip = (((a, 6), (d, e)), c), 
taking A ~ {a, 6, d, e} we obtain 

(PaiAC) + Pa{ABC) + PaiACE) + P,{ABCE)) 

- (PaiCD) + Pa{BCD) + Pa{CDE) + P^BCDE)) = 0, (10) 

and three other invariants that can also be obtained by cherry swapping from 
Eq. (10). Since PaiACE) = P„{BCD) by cherry swapping, two of the eight 
terms can be cancelled. 

Remark. All the linear invariants above for 3-, 4-, and 5-taxon trees are, of 
course, among those that can be found computationally. Grobner basis calcu- 
lations do not necessarily produce exactly these, but by cherry-swapping and 
taking suitable linear combinations of computed linear invariants, all of these 
appear. However, at least for trees on four and five taxa, there are additional 
linear invariants beyond the ones of Theorem 6. We give these in Appendix 
Appendix B, as it would be interesting to have non-computational means of 
obtaining them, as well as the higher degree invariants. 

6. Identifying clades 

Suppose we are given the clade probabilities {Pa{Cg)} arising from the mul- 
tispecies coalescent on an unknown species tree tr, and we wish to know if a 
displays a particular clade. By the results of Section 4, high probability may 
identify some clades on a. However, it remains to be seen how one might iden- 
tify clades on cr that have lower probability of occurring on gene trees as a result 
of high levels of incomplete lineage sorting. 
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From Section 5 we know that if ^ is a clade on cr then for every non-empty 
subset C C X \ A, and every a,b Q A, the linear invariant associated to A, C, a, 
and b vanishes. For these invariants to be useful for identifying clades, however, 
we must also know that if a does not display the cladc A, then one of these 
invariants does not vanish. 

Lemma 8. Suppose A is a non-trivial clade on a species tree a, and a d A and 
b E X \ A. Let B denote the set obtained by replacing a with b in A, that is, 
B^{A\ {a}) U {b}. Then ¥^{Ag) > ¥a{Bg). 

Proof. Let V = MRCA(^ U {b}) on a. Let ^' = ^ \ {a}, so ^ = {a} U A' and 
B^{b}UA'. 

Then, using phrases such as 'A coalesces above w' to mean the lineage of A 
first coalesces with any other gene lineage in a population in the species tree 
above the node v, 

F^iAg)^FA{A}uA[J (11) 
= Pcr({^} U A'g and A coalesces below v) + (12) 
Per ({A} U Ag and A coalesces above v) 

> Fai{A} U A'g and A coalesces above v) (13) 

> Pcr({^} Li A'g, A coalesces above u, and B coalesces above v) (14) 
= P(t({B} U A'g, A coalesces above v, and B coalesces above v) (15) 
= FA{B}UA'g) (16) 
= FABg). (17) 

The equality between lines (14) and (15) is due to exchangeability of lineages; 
given any sequence of coalescences in the event of line (14), there is an equally 
probable sequence of coalescences in the event of line (15) in which B coalesces 
in A's place to form {BjuA'g instead of {A}U^^ . Thus, F^{Ag) > F„{Bg). □ 

Remark. One might wish to extend the above result to sets obtained by replac- 
ing k elements in a clade with k elements outside it. However, simple examples 
show that this is impossible. For instance, if cr = ((a, b):x, c):y, (d, e):z) where z 
is large and both x and y are small, then one can have Va-{ABC) < Va-{CDE). 
For example, if {x,y,z) = (0.05,0.05,2.0), then the highest probability clades 
are DE, AB, AC, BC, CDE, and ABC with probabilities 0.889, 0.269, 0.220, 
0.220, 0.194, and 0.188, respectively (computed by COAL [7]). Thus for these 
branch lengths, we have Fr,[ABC) < F^ICDE), and the greedy strategy of ac- 
cepting the most probable clades one-at-a-time returns the non-matching tree 
((a,&),(c, (d,e))). 

The same example shows that for a set C d X there can exist y £ C such 
that for all X G A" \ C, F„{Cg) > P^((C \ {y}) U {x}) and yet C is not a clade. 
In this example, {c, d, e} is not a clade on the species tree, yet CDE is more 
probable than ADE or BDE on gene trees. 
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Lemma 8 allows us to show that if all cherry swapping invariants are satisfied 
for a particular candidate 2-clade, it is in fact a 2-clade on the species tree. 

Proposition 9. ( Clade probabilities determine species 2-clades.) For a = {ip, A) 
an n-taxon binary species tree on a set of taxa X, the 2-clades of ip are identi- 
fiable from clade probabilities. In particular, for any a,b E X , {a, b} is a clade 
on Tp if and only if for every V C X \ {a, b}, ^ai{A} U Vg) = V^{{B} U X»g). 

Proof If {a, b} is a clade on cr, then by Corollary 7, P{{A}UVg) = P{{B}UVg) 
for any taxon set T) not containing a or b. 

Suppose now that {a, b} is not a clade on ip. Then, because a is binary, at 
least one of a or 6 (let us say a) is in a non-trivial clade C on -0 that excludes 
the other. Let T) = C \ {a}. 

By Lemma 8, P^({A} UPg) ^ P^({B}UPg). □ 

For clades of more than two taxa on a species trees, we obtain a slightly 
weaker result: As long as the edge length vector A does not lie in a set of 
measure zero, then the clades on the species tree can be identified. The first 
step toward this result is the following. 

Lemma 10. Let ip be a species tree topology on X , and X ^ AWD a disjoint 
union of non-empty subsets with \A\ > 2. Then if A is not a clade on ip and 
MRCA(y^) is a binary node, then there exists some C C I?, a,b A, and some 
choice of edge lengths A such that the corresponding clade invariant of Theo- 
rem 6 does not vanish on the clade probabilities arising under the multispecies 
coalescent on a ~ {tp, A). 

Proof. Suppose A is not a clade on ip. Let v = MRCA(.A) on ip, so £ = 
desc;t(w) \ v4 is non-empty. One or both children nodes wi,W2 of v have an 
element of f as a descendant, so we may assume C = descx{wi) Cl £ is non- 
empty. Let a £ Aridescx{wi), b G Ar\descx{'W2). Consider the clade invariant 
of Theorem 6 associated to A, C, a, b. 

We next give edge lengths A for which this invariant will not vanish at the 
clade probabilities arising from the multispecies coalescent on a = (ip^X). Let 
all internal edges of ip below v have length (near) except the edge {v,wi) 
which is assigned length (near) oo. Lengths of edges above v can be fixed at 
any finite non-zero values. 

With these assignments, the only partition of desc;f (w) according to lineages 
at V that appears with non-negligable probability is that with desc;t'(u'i) forming 
one partition set, while all elements of desc;^ (u'2) are in singleton sets. But since 
C C desc;f (wi), a € desc;v;'(it;i) and & is in a singleton set, the only clades that 
can result with non-negligible probability that contain both B and elements of 
Cg must also contain A. Thus all the clades appearing in the second term of 
Eq. (2) have probability arbitrarily close to 0. However, the clade {dcscx{wi))g 
appears in the first term and has non-negligible probability. Thus Eq. (2) is 
violated. □ 
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Theorem 11. Let ip be a rooted binary species tree topology on X , where X = 
AUT) is a disjoint union of non-empty subsets. If A is not a clade on tp then 
for all choices of edge lengths A except those in some set of measure zero there 
exists some C ^ V, a,b £ A, such that the corresponding clade invariant does 
not vanish on the clade probabilities arising under the multispecies coalescent on 

Proof. The clade probabilities arising from a can be expressed as polynomials 
in the exponentials of the negatives of the interior edge lengths. By Lemma 
10, there is an invariant which, when composed with this polynomial map, does 
not vanish at some point in the space (0, 1]"~^ of these exponentials. But since 
this composition is a polynomial, its non-vanishing at some point implies the 
set where it vanishes has measure zero in (0, 1]"~^. Mapping this set to interior 
edge lengths by — log(a;) shows the set of edge lengths for which the invariant 
vanishes has measure zero. □ 

Since, except for a negligible set of edge length parameters, whether a species 
tree has a particular clade can be tested by examining clade probabilities, one 
can similarly determine the full species tree topology. 

Corollary 12. Let ip be a rooted binary species tree topology on X . For generic 
choices of edge lengths A, ip can be identified from the probabilities of clades 
under the multispecies coalescent on a = {tp, A). 

Proof. For any subset of taxa A C X, ii we find any invariant given by Theorem 
6 that fails to vanish on the clade probabilities for a = {ip,X), then A is not 
a clade on ijj. If all such invariants vanish, then by Theorem 11, cither ^ is a 
clade on -0, or A lies in a set of measure zero (which is dependent on C,a, 
and b used in defining the invariant). 

Thus, considering all proper subsets A of X, we can determine all clades, 
unless the edge lengths A lie in a set of measure zero (the finite union of sets of 
measure zero for each invariant.) 

Finally, the clades of ip determine ip. □ 

Remark. If one considers a non-binary species tree to be specified by the binary 
tree topologies of some resolution along with the assignment of edge length 
to any introduced edges, then both Theorem 11 and Corollary 12 still apply. 
Indeed, the special choices of some edge lengths form a set of Lebesgue measure 
zero in the full set of possible edge lengths, so regardless of whether such trees 
can be identified, the statements remain valid. 

A particular feature of non-binary species trees that is identifiable is a k- 
cherry, a set of k > 2 leaves {xi . . . , Xk} X that both share a common parent 
node and form a clade. This will prove useful for identifying the extended species 
trees defined in Section 2, which describes the sampling of multiple individuals 
per taxon. 
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Proposition 13. ( Clade probabilities determine extended species tree k-cherries.) 
Let a* = {-ip*, A, 6) be an extended species tree on X for which the pruned species 
tree tp is binary. Then the k-cherries oftp* are identifiable from gene clade prob- 
abilities from the multispecies coalescent on a* for all choices of edge lengths X 
outside a set of measure zero. 

In particular, {xi^ , . . . , x^^.^,. } C X* is a k-cherry on tJj* if, and only if 
it is a maximal subset of X* such that for every I < I < m < k and every 
y &X* \ {x.,,j,,Xi^j^}, 

p.({x,,„,r}) = p.({x,„,„,r}). 

Proof. Let /C = {xi-^j-^ , • ■ • , ^i^jk } be a /c-cherry on ^p* , with MRCA the node v. 
Then for any y € X* \ }, ¥,{{X,,,,,Y}) = ¥,{{X,^j^,Y}) by the 

exchangeabihty of j, and Xi^j^. 

To see /C is maximal with respect to this property, suppose z € X* \ /C. (If 
no such z exists, maximahty is clear.) We show K, cannot be augmented by z 
by showing that fa{{Xi^j-^^, Xi^j^}) ^ Va{{Z, Xi^j^}) for some choice of A. This 
then implies the same statement for generic values of A, since these probabilities 
are polynomials in the exponentials of negative branch lengths. 

Choose all internal branch lengths of the species tree to be (near) except 
for the branch e above v, which we choose to have length (near) cxo. Consider 
the event E that the Xi^j^ and Xi2j2 lineages coalesce on e and are the first of 
the /C lineages to do so. Then one sees that 

p,({x,,,.,x,,,j) >p,(£;) 

where the approximation becomes increasingly accurate as more extreme branch 
lengths are chosen. However such choices of branch lengths make ¥a-{{Z, Xi^j,^}) 
as close to as desired, since the probability of the clade IC goes to 1, and this is 
incompatible with clade {Z,X,^j^}. Thus ¥ai{X^,J,, X,^j^}) ^ P<^({Z, X^^^ J). 

To establish the converse, suppose now that /C is maximal with respect to 
the stated property, but is not a fc-cherry. By the above argument, maximality 
implies K. is not a subset of any /-cherry for I > k. 

To achieve a contradiction, it is sufficient to show that there exist Xi-^j-^ , Xi2j2 G 
IC, y G X* such that ¥^{{X^^j^,Y}) ^ ¥ „[{Xi2j.^,Y}) unless branch lengths lie 
on a set of measure 0. Let v = MRCA(/C). Since /C is not contained in an 
Z-cherry, there exists a non-leaf node w which is a child of v. Moreover, v is 
binary, since is. 

Choose Xijji G /C \ descA'(w), which is non-empty because v = MRCA(/C). 
The node w has at least two distinct leaf descendants, and since v is binary at 
least one leaf descendant of w must be in K. Choose Xi^^^ descA'(w), and 

y G descA'(w) \ {^i^n}- 

Let m = \Aie&cx(w)\. By choosing the length of the edge connecting w 
and V to be near oo, and the length of all edges descended from w to be near 
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0, the probability of clade {Xi-^j-^ , Y} will be arbitrarily close to 0, and the 
probability of the clade {Xi^j^ , Y} will be bounded below by a number arbi- 
trarily close to (™) ^, as in the argument above. Thus, there exist branch 
lengths with Pa-{{Xi^jj^,Y}) ^ PcrdXi^j^jY}). Because these probabilities are 
polynomials (in transformed branch lengths), the set of branch lengths where 
F,{{X,,,„Y}) = Pai{X^,j„Y}) has measure 0. □ 

Finally, we apply Proposition 13 to show generic identifiability of species 
trees from clade probabilities when there are <5i > 1 lineages sampled for taxon 
X,. (See Fig. 2.). 

Corollary 14. Let a* = {ip*,X,5) be an extended species tree, for which the 
pruned species tree ip is binary. For generic choices of edge lengths X, the topol- 
ogy ofip* can be identified from the probabilities of clades under the multispecies 
coalescent. 

Proof. All fc-cherries of "0* can be identified by Proposition 13 (although this 
is unnecessary if one assumes species assignments are given). By assumption 
there are no other polytomies on ip*] for any other clade A on the extended 
species tree, v = MRCA(^) is a binary node. Thus Theorem 6 and Lemma 10 
apply. These imply that for generic A all clades on the extended species tree 
can be identified, and hence if)* can be identified. □ 

Since this corollary does not assume that the assignment of individuals to 
taxa is known in advance, it implies that under some circumstances species as- 
signment can be deduced from clade probabilities. In particular, any taxon for 
which three or more individuals are sampled will be identifiable from ip* . How- 
ever taxa in which two individuals have been sampled will be indistinguishable 
from two taxa forming a 2-clade with one individual sampled from each. 

7. Discussion 

We have shown that for generic branch lengths on a binary species tree, it 
is possible to identify clades of the species tree, and therefore the species tree 
topology, from probabilities of clades on gene trees. More generally, we showed 
identifiability of clades consisting of taxa descended from binary nodes even if 
the species tree is not itself binary. In addition, we investigated how probable 
a clade on a gene tree must be to infer it is also a clade on the species tree. 

We have not shown the identifiability of branch lengths from clade prob- 
abilities. However, for any given species tree topology it is possible to write 
systems of equations of clade probabilities as functions of the branch lengths. 
(As examples, consider the systems of equations for clade probabilities for some 
4-taxon species trees shown in Table A.l.) These systems are non-linear but 
polynomial in the transformed branch lengths. Since the number of branch 
length parameters is n — 2 for an n-taxon tree and there are 2" — n — 2 non- 
trivial clade probabilities, it is reasonable to expect such systems to be solvable, 
in principle, for any sized tree. Although for particular small trees these can 
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be solved, we have not found a general method applicable to arbitrary trees. It 
thus remains conjectural that species tree branch lengths are identifiable from 
clade probabilities. If multiple individuals are sampled from some species, then 
the species tree has additional branch length parameters (for branches leading 
to such species), but will have an even larger numbers of clades probabilities 
that could conjecturally be used to estimate branch lengths. 

While the invariants of Theorem 6 are useful in proving idcntifiability of a 
species tree topology, they do not immediately indicate a practical way to infer 
the species tree from clade probabilities. In particular, each term in the invariant 
of Theorem 6 is the probability that a random gene tree has a clade that is not 
a clade on the species tree. The clade probabilities needed for the invariant of 
Theorem 6 may therefore be quite small. For species trees with moderately long 
branches, many of these probabilities could be difficult to estimate from finite 
data sets. However, in such a situation the results of Section 4 might offer an 
alternative way of inferring species tree clades as those which occur with high 
frequency on gene trees. This suggests the possibility of a hybrid approach in 
which one accepts highly probable clades as being clades on the species tree, as in 
a greedy algorithm, yet exploits the symmetries of clade probabilities expressed 
by invariants to determine other species clades. Thus our idcntifiability results 
should motivate further research on species tree inference methods that are 
statistically consistent and that can outperform greedy consensus on typical 
data sets with imperfectly estimated clade probabilities. 
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Appendix A. Clade probabilities for subtrees as linear combinations 

An essential difficulty in dealing with clade probabilities in mathematical 
arguments is that it is not easy to see relationships between probabilities of 
clades on gene trees arising from a species tree on a set of taxa and the clade 
probabilities on the induced gene trees obtained by restricting the set of taxa 
to a smaller set. This frustrates the common approach used to prove results for 
large trees, by inductive arguments on the number of taxa. 
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Table A.l: Probabilities of cladcs under three 4-taxon species trees. X = exp(— a;),Y 

ex p(-;/). 

probability under species tree 
clade {{{a,b):x,c):y,d) ({a, d):x, {b, c):y) {{a, b):x, {c, d):y) 

ci = ¥{AB) 1 - |x - sxys |xy 1 - |x - |xy 



c3 = p<t(ad) ixy + ^xy3 i-2x-|xy |xy 

c4 = p^(_BC) ^x-|xy3 i-|y-|xy |xy 

c5 = Pa(Br') ixy + ixy3 ^xy |xy 

c6 = Pa(CD) iy_ixy + ^xy3 |xy i-|y-|xy 

c7 = p<^(ABC) i-|y-ixy + |xy3 |x-|jYy |y-|xy 

cs=PaiABD) ly-ijfy |y_|xy |y _ |xy 

c9 = Pa(ACD) |xy |y-|xy |x-|xy 

cio = p^(BCD) IxY |x-|xy ix-|xy 



Consider a set of taxa X, a proper subset y C X, and a species tree a = 
('i/'jA) on X. We show here that, in general, probabihties of gene tree clades 
for the induced species tree on y, cr\y, cannot be written as the same hnear 
combination of gene tree clade probabilities for a for all choices of tp. Thus 
there is no linear formula for the clade probabilities for the smaller taxon set 
that does not depend on the species tree topology. 

This is in contrast to, for example, gene tree probabilities for a\y, which 
can be written as linear combinations of gene tree probabilities for tr, where 
the weight assigned to each gene tree probability in the combination has no 
dependency on a. 

To show that the same linear combination cannot be used to marginalize 
clade probabilities independently of the species tree, we consider three species 
tree topologies on four taxa, as given in Table A.l : (((a, b), c),d), ((a, d), {b, c)), 
and ((a, 6), (c, d)). There are 10 non-trivial clades, so any linear combination of 
clade probabilities ci , . . . , cio has the form 

10 

^a,Ci (A.l) 

i=l 

for some ai, . . . , aio- 

We consider obtaining the probability of clade CD when the taxon set is 
restricted to {a, c, d} {i.e., marginalizing over taxon b). Assuming first that the 
species tree is ((a, d):x, {b, c):y), the restricted species tree is ((a, d):x, c), and the 
probability of clade CD is . If a linear combination of the clade probabilities 
on the larger tree is to yield this probability, then by inserting the formulas for 
the Ci from Table A.l into Eq. (A.l) and equating coefficients, we obtain the 
following equations: 
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as + ai ~ 0, 
— 2a3 + aj + aio = 1, 

— 2a4 + Qfg + Qfg = 0, 

4ai + 4q!2 — 2q;3 — 2a4 + Aa^ + 4q;6 — Say — Sas ^ Sag — Saio = 0, 

where the rows correspond to the coefficients of 1, X , Y, and XY. The system 
is underdetermined since there are 10 unknowns and only four equations. 

Similar systems can be obtained by considering other species trees. For the 
other species trees in Table A.l, {{{a,b):x,c):y,d) and {{a,b):x, {c,d):y), respec- 
tively, restricting to taxa {a, c, d} leads to trees ((a, c):y, d) and (a, (c, d):y), and 
probabilities of clade CD that are ^Y and 1 — |F. Equating coefficients on all 
three species trees in Table A.l. we have the equations encoded by the following 
13 X 11 augmented matrix: 
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Here rows 1-5 represent the system of equations implied by the species tree 
(((a, &), c),d), rows 6-9 represent the system of equations corresponding to the 
species tree ((a, d), (&, c)), and rows 10-13 represent the system of equations 
corresponding to the species tree ((a, 6), (c, d)). Gaussian elimination shows 
this system of 13 equations is inconsistent. 

Appendix B. Additional clade invariants for small trees 

For trees on five or fewer taxa, computations of a Grobner basis for invariants 
in clade probabilities show that the construction of Theorem 6 fails to produce 
all invariants, or even all linear ones. In this appendix, we indicate the results 
of such computations that we performed using the software Singular [11]. We 
emphasize that by linear invariant we mean linear homogeneous invariant, so 
that the trivial invariant, which is inhomogeneous, is not counted when we give 
dimensions of spaces. 
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For the 3-taxon tree there is only a single invariant, the linear one arising 
from cherry-swapping, produced by Theorem 6. 

For the 4-taxon balanced tree topology ((a, 6), (c, d)), there is a 6-dimensional 
space of linear invariants, yet the ones constructed in Theorem 6 span only a 
5-dimensional subspace. The additional generator needed to obtain all linear 
invariants can be taken to be 

Pa{AB) - F^iCD) - 2P„{ABC) + 2P„{ACD). 

The ideal of all invariants has just one additional generator, which is quadratic. 

For the 4-taxon caterpillar tree topology (((a, b),c),d), there is a 5-dimcnsional 
space of linear invariants. However the construction of Theorem 6 produces only 
a 4-dimensional space of linear invariants. For the full space of linear invariants, 
the polynomial 

¥^{AB) + 2P^(AC) + 9P^(CD) - ¥^{ABC) - n¥^{ABD) - A¥^{ACD). 

can be taken as the missing generator. 

In addition, there were one quadratic and three cubic polynomials in a full 
Grobner basis. 

For the 5-taxon balanced tree topology (((a, b), c), {d, e)), the construction of 
Theorem 6 produces a 14-dimensional subspace within a 16-dimensional space 
of linear invariants. Additional generators can be taken to be 

22P^(Ci:') + 5F^{DE) - 5Fa{ABC) - 22F„{ABD) + lbF„{CDE) 

+ 10F„{ABCD) - 25F„{ABDE) - 20F„{ACDE) 

and 

n¥„{AB)+22F„{AC)2-25F„{DE)+UF„{ABC)-22F„{ABD)-UF^{ACD) 
+ 24F„{CDE) - 50F„{ABCD) + AF„{ABDE) + b&F„{ACDE). 

In addition to the linear invariants, there are eight quadratic invariants and 
13 cubic invariants in a Grobner basis for the ideal. 

For the 5-taxon psuedo-caterpillar tree topology (((a, 6), (d, e)),c), the con- 
struction of Theorem 6 produces a 13-dimensional subspace within a 14-dimensional 
space of linear invariants. An additional generator can be taken to be 

F„{AB) - F„{DE) - 6F „{ABC) - 2F„{ABD) + 2F„ {ADE) + 6F „{CDE). 

The algorithm for computing the full ideal of invariants for this topology did not 
terminate in a reasonable amount of time, so the full ideal remains unknown. 
Partial computations in which the degree of generators is bounded show that 
there are generators in degrees 2, 3, 4, 5, and 6, in addition to linear invariants. 
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For the 5-taxon caterpillar tree ((((a, 6), c), d), e), the construction above 
produces a ll-dimensional subspace within a 12-diniensional space of linear 
invariants. One choice for the additional generator is 

5P^{AB) + lOP^(AC) + 24¥^{CD) + 62F^{DE) + 2F^{ABC) - 20P^{ABD) 
- 29¥„{ABE) + 8F:y { AC D) - 58¥a{ACE) + A5¥a{CDE) - 7F„{ABCD) 
- 76¥„ {ABCE) ~ U¥„{ABDE) + 2F „ {AC DE). 

Our attempt to compute a Grobncr basis for the caterpillar topology did not 
terminate in a reasonable amount of time. Wc did, however, find quadratic 
generators in addition to the linear ones, but found no higher degree generators. 
It is reasonable to speculate that the full ideal is generated in degree one and 
two for this topology. 

It would be quite interesting to find general constructions that lead to the 
additional linear invariants not explained by Theorem 6. Similarly, understand- 
ing the structure of higher degree invariants by non-computational means is an 
open challenge. 
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